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NONPARAMETRIC TESTING FOR HETEROGENEOUS 

CORRELATION 
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Abstract. In the presence of weak overall correlation, it may be useful to 
investigate if the correlation is significantly and substantially more pronounced 
over a subpopulation. Two different testing procedures are compared. Both 
are based on the rankings of the values of two variables from a data set with a 
large number n of observations. The first maintains its level against Gaussian 
copulas; the second adapts to general alternatives in the sense that that the 
number of parameters used in the test grows with n. An analysis of wine 
quality illustrates how the methods detect heterogeneity of association between 
chemical properties of the wine, which are attributable to a mix of different 
cultivars. 


Introduction 

The goal of this paper is to offer new methods for discovering association between 
two variables that is supported only in a subpopulation. For example, while higher 
counts of HDLs are generally associated with lower risk of myocardial infarction, 
researchers (Voight et al., 2012; Katz, 2014) have found subpopulations that do not 
adhere to this trend. In marketing, subpopulations of designated marketing areas 
(DMAs) in the US respond differentially to TV advertising campaigns, and the 
identification of DMAs that are sensitive to ad exposure enables efficient spending 
of ad dollars. In preclinical screening of potential drugs, various subpopulations of 
chemicals elicit concomitant responses from sets of hepatocyte genes, which can be 
used to discover gene networks that breakdown classes of drugs, without having to 
pre-specify how the classes are formed. The new methods thus lead to a whole new 
approach to analysis of large data sets. 

When covariates are available, regression analysis classically attempts to iden¬ 
tify a supporting subpopulation via interaction effects, but these may be difficult 
to interpret properly. In the presence of overall correlation, it may be useful to 
investigate directly if the correlation is significantly and substantially more pro¬ 
nounced over a subpopulation. This becomes feasible when representatives of sup¬ 
porting subpopulations are embedded in large samples. The novel statistical tests 
described in this paper are designed to probe large samples to ascertain if there is 
such a subpopulation. 

The general setting is this: A large number n of observations are sampled from 
a bivariate continuous distribution. The basic assumption is that the population 
consists of two subpopulations. In one, the two variables are positively (or neg¬ 
atively) associated; in the other, the two variables are independent. While some 
distributional assumptions are required even to define the notion of homogeneous 
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association, the underlying intent is to make the tests robust to assumptions about 
the distributions governing both the null and alternative hypotheses. 

Notation for the rest of the paper is as follows: Let X ~ F and Y ~ G have joint, 
continuous distribution H. For any sample {(xi,yi) \ i = 1, the empirical 

marginal distributions are defined by 

-y n i n 

F n (x) = - V' 1 {xi < x} and G n (y) = - V' 1 {y, < y} . 
n z ' n z ' 

i =1 i =1 

The ranking 7r of the sample {a^ | i = 1,..., n} is the function 
7r : {xi | i = 1,..., n} —> {1,..., n) defined by 

n 

7T (xi) = < xj} . 

j =i 

The corresponding ranking of {j/» | i = 1, ■ ■ • ,n} is denoted by v. Spearman’s 
footrule distance with a sample {(#*, y.j) | i = 1,..., n} is defined through the sam¬ 
ple rankings as 

n 

ds = \ n ( x i) - v {yi) I • 

4=1 

The Kendall distance associated with the sample is defined as 

D K ([*1, . . • , X n ] , [2/1,-, 2 In]) = 1 {( Xi _ X o) (Vi ~ Vo) < °l 

i<j 

= 1 {(tt (Xi) - 7T (Sj)) (v (Vi) - V (yj)) < 0 } 

i<j 

= d K (tt, v) 

which depends only on the rankings 7r and v of the sample {xt} and {?/;}. Mallows 
(1957) model for rankings takes the form 

P </> (v\n) = C(<f>) 

where the normalizing constant C (</>) has a tractable form (Fligner and Verducci, 
1986) known as a Poincare polynomial (Diaconis and Graham, 2000). Distributional 
forms for the data are in terms of copulas: 

C h (F(X),G(Y))=H(X,Y) 


which are distribution functions on the unit square, having uniform margins. Two 
copulas play a fundamental role in motivating the tests: the Gaussian copula and 
the Frank Copula. If ( X, Y) has a bivariate normal distribution H with correlation 
p, then its corresponding copula is 

C p (u, v) = $ 2 ($ _1 (u), $ _1 (v ); p) 

where $ is the standard normal CDF. The bivariate distributions C p and <1>2 are 
indexed solely by the underlying correlation p. The Frank copula (Frank, 1979; 
Genest, 1987) has the form 


C e (u, v ) 




(e~ 9u - 1) (e~ ev 
(e~ 0 ^ 1) 



The next two sections describe two new tests for detecting subpopulations that 
support association: the Components of Spearman’s Footrule (CSF) test and the 
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Components of Kendall’s Tau (CKT) test. The CSF test is scaled according to 
a Gaussian copula and the CKT test is scaled according to a Frank copula. The 
CSF test is computationally fast, and the CKT test adapts to a large variety of 
alternatives. The following two sections cover their performance under simulations. 
Concluding remarks are in last section. 


Components of Spearman’s Footrule (CSF) 

While Spearman’s footrule (Diaconis and Graham, 1977) measures the overall 
disarray in a sample, the distribution of individual absolute rank differences 

di = \n (xi) - u(yi )| 

proves to be very useful in detecting subsamples with distinctly less disarray than 
would be expected under homogeneous association. Because the rankings depend 
on the whole sample, the {di} are not independent. Nevertheless, we loosely define 
their empirical distribution as 

1 n 

S n {d) = - 'Y' 1 {di < d} . 

n z —' 

i —1 

As a step toward determining asymptotic forms for this distribution, we offer the 
following lemmas. 


Lemma 1. For any sample {(Aj, Yj) | i = 1,... ,n}, from a joint distribution H 
with compact support, let (X,Y) be a newly, independent sampled observation. 
Then, for rankings 7r and v for the extended sample of n + 1 observations, 


*(x) v(y) 

n + 1 ’ n + 1 


[F(X),G(Y)} 

a.s. 


and its asymptotic distribution is the underlying copula Ch [F (A), G (Y)] of H. 


Lemma 2. Under independence, the asymptotic distribution of the scaled absolute 
rank differences 

_ 7r(A) V{Y) 

n+ 1 n +1 

is Beta( 1, 2). 

Proposition 3. Under a Gaussian(p) copula, S n converges to a Beta(l, (3 (p)) 
distribution. 


Although we do not have a formal proof for this proposition, many simulations 
with n = 1000 affirm the proposition and produce a smooth curve for ft (p). See 
Figure 1 for one such example. 

The null hypothesis is that (A, Y) have a Gaussian copula. The alternative is 
that (A, Y) come from a mixture of two subpopulations in which under one they 
are independent, and under the other they are positively associated. To test for 
negative association, simply replace Y by —Y. No particular form is assumed for 
the positively associated subpopulation, but it is informative to examine the case 
where this component is Gaussian. Figure 2 illustrates S n and its histogram under 
such a mixture. 

Because the differences in distributions under the null and alternative are small, 
large samples are required to distinguish the two. As noted from the histogram 
in Figure 2, most of the distinguishing information is contained at the low end 
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Distribution of Rank Differences 



Figure 1. Left panel shows the close agreement of S n with 
Beta(l, /3 (/?)) when sampling n = 10,000 observations from a 
Gaussian(p = 0.2) copula. The right panel illustrates the /3 ( p ) 
curve for 0 < p < 0.5. 
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Figure 2. (Standardized) Distribution of Rank Differences Under 
Mixture of Gaussian(0.6) and Independent Copulas with overall 
correlation 0.3 compared to Beta(l,2.65). 


of the distribution. This makes sense because a subpopulation supporting positive 
association should have a surplus of points where the ranks of X and Y closely agree. 
Thus a test statistic based on absolute ranked differences should emphasize the lower 
order statistics. Such statistics come under the heading of L-statistics. It is possible 
to tailor a test toward alternative features of interest such as proportionate size of 
the subpopulation and the strength of association within it. Exact distributions of 
partial or weighted sums of absolute rank differences are quite complicated due to 
dependencies (Sen, et al. 2011), even under the null hypothesis of independence. 
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A very simple general purpose test statistic is 


Ts = J2l 

i =1 


k(*i) -v{y%) I 


< 0.2 


Using the observed overall correlation r in place of p, the null distribution of T$ 
may be simulated under the Gaussian copula or approximated as a Binomial test 
statistic using the probability from the Beta(l,/3) as in Proposition 3. In the later 
case, ignoring weak dependencies, the .05 level test has power of 80% of detecting 
a Gaussian subpopulation of 25% with r =.8 for n = 1000. 


Components of Kendall’s Tau (CKT) 

Although the CSF test is both simple and computationally efficient, it has a 
conceptual shortcoming arising from the use of Spearman’s footrule distance to 
characterize association in a subpopulation. The issue is that the components of 
the footrule distance in the subpopulation depend on the encompassing population; 
that is, when the sample is a full population, with associated subpopulation fl, the 
component set from the footrule from fl 

{di | i G 0} = {| Tr(xi) - u(yi)\ \ i e fi} 

depends heavily on the rankings 7r and v determined by the full population. In 
contrast, the component set from Kendall’s distance depends only on the relative 
rankings within fl, which may be constructed from just on the original values in fl. 
That is, 

U [(vr (ar*) - tt (xj)) {v (j/») - v {Vj)) < 0] | i,j € U} 

= {1 [{Xi - Xj) (yi - yj) < 0] | i, j e U} 

Thus the subpopulation discordances (components of Kendall’s distance) do not 
depend upon the embedding population, whereas the subpopulation disarray (com¬ 
ponents of Spearman’s footrule distance) do. This invariance has a number of ben¬ 
eficial properties, such as allowing the CKT test to retain power in situations where 
the ranges of the {Xj} and {Y,:} values in the subpopulation are more restricted 
than those in the full population. 

The notion of homogeneous association based on Kendall’s distance differs from 
that based the Spearman’s footrule used for the CSF test. In this case the natural 
null hypothesis should be a distribution depending only on Kendall’s distance. 
Furthermore it should have the greatest entropy for a given value of Kendall’s tau 
because this formulation would attribute as much variability as possible to the null 
distribution, making it a conservative (least favorable) test (Lehmann and Romano, 
2006). To construct a distribution that has this structure, simply sample from an 
arbitrary copula, and then reorder the T-values according to a permutation v (Y) 
sampled independently from a Mallows model centered at the ranking tt (X) of 
the X-values. Quite remarkably, any such process asymptotically leads to a Frank 
copula. Proposition 4, based on Starr (2009), gives a precise statement. 

Proposition 4. Let {( Xj,Yj ) | i = 1,... , n} be independent samples from a dis¬ 
tribution H with continuous marginals F and G, and associated copula C with 
continuous partial derivatives. Let tt (X) be the ranking of tt (X) = [Xi, ..., X n \ 
and v (Y) be the ranking of v (Y) = [Y l5 ..., Y n ] . Assume that for all n sufficiently 
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large, the conditional distribution of v (Y) given i r (X) is Mallows, with center at 
7 r (X) and scale (f> n . If <p n —> 0, and there exists 9 ^ 0 such that 

n (l — e~ (t ’ n ) -A 9, 

then C is the Frank Copula Cg. 


Proof. First, we establish that if the conditional distribution of v (Y) given tt (X) 
is a Mallows distribution, then the copula C is radially symmetric. The pseudo¬ 
observations for each pair (X),Yj) are defined as functions of the pair and the 
empirical margins 

(Ui,V t ) = ^^-(j’ n (X i ),G n (Y < )). 

These are functions of the rankings tt (X) and v (Y) : 


Ui = 1- 


7T {Xi) 


Vi = 1- 


*0® 


n + 1 n + 1 

By the symmetry of the Mallows model, the joint distribution of the pseudo¬ 


observations ,..., (lJ n , Vr^j 


is identical to the joint distribution of 

^1 — f/i, 1 — V^j ,...,^1 — U n ,l — V n J . Consider empirical distributions based on 
these observations (Genest and Neslehova, 2014): 


C n (u, V ) = n Y, l {Ui<U,Vi<v} 

i—1 
1 n 

D n (u,v) = ~ ^ 1 {l ~ Uj < U, 1 ~ Vi < . 

1 i= 1 


Since H has continuous marginals and C has continuous partial derivatives, then 
Fermanian et al. (2004) established that C n is a consistent estimator of the copula 
C, and likewise D n is a consistent estimator of the survival copula C, where 


C (u,v) = u + v — 1 + C (1 — u, 1 — v ). 


Hence, C = C, which implies that the copula C is radially symmetric (Nelsen 2006, 
pg. 37). Since C is radially symmetric, an asymptotically equivalent definition of 
the empirical copula is 


C n {u, v) 



i =1 


TT (Xi) < ^ v{Yj) 
n ~ n 


< v 


1 

n 


n 

/M $(K(Xi)/n,v(Yi)/n) 
i=l 


which places mass of ^ on each random point , v ^n' j ^ IP> ^ • This empirical 

copula is expressed by the following point process (Starr, 2009): For n G N, 


hn (B,u) = -Vl 

71 L ' 


i =1 


TT (Xi) I/(Yi) 


G B 


for each bounded Borel set BCR 2 . 
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By assumption, the regularity conditions on the Mallows scale are satisfied as 
n —> oo: 


39 G 


9 n (l — e <t>n ) 


Under these conditions, the primary result of Starr (2009) is applied: As n —► oo, 
the random measures /j, n (•, w) weakly converge to the measure /j,g, defined by 


d/-i$ (u, v) = 


( 0 / 2 )sinh ( 0 / 2 ) 


(e e / 4 cosh (0 [u — u] / 2 ) — e~ 9 / 4 cosh (0 [u + v — 1 ] / 2 )) 


j-/[ 0 u 2 ( u , v) dudv. 


Simply converting the trigonometric functions to exponential form and simplifying 
yields 


0 (l - e~ 9 ) e~ e ( u+v '> 

dug ( U , v) = - -----—--«-T 2 J [ 0 ,l ] 2 ( u > v ) dudv - 

(1 - e~ e - (1 - e~ 9u ) (1 - e~ 9v )) 

By recognition, the limiting measure d,jig is that of the (Frank) Copula Cg. Recall, 
C n is a consistent estimator of the underlying copula C, and converges weakly to 
Cg, so we conclude that C = Cg. □ 


Pursuing this result further allows for inspection of the adequacy of the as¬ 
ymptotic result for finite samples. A function <f> (0) for matching the Mallows <t> 
parameter to the Frank 0 parameter may be obtained by equating expressions for 
70 and Tg from these models. For any Archimedian copula, there is a relatively 
simple formula r = 4E [C (U, V) — 1] (MacKay and Genest, 1986); for the Frank 
copula, a specialized form (Nelsen, 2006, p. 171; Genest, 1987) is 


Tg 







dt 


where the scaled integral D ( 7 ) is known as the Debye-1 function, available in the 
“gsl” (Gnu Scientific Library) package of R. For the Mallows model, 

2 

70 = — arctan (.18n/>) 

77 

Equating Tg and 77 , leads to the relationship 


100 

18n 


■ tan 


7T f 4 

2 


.9694 

- 6 

n 


Empirical evidence for the applicability of Proposition 4 comes in two stages: 1) 
The distribution of Kendall’s Distance under Frank(0) and under Mallows(</ (0)) 
both converge to the same normal distribution; 2) As n gets large the product 
density of the sample under Frank(0) converges to an increasing function of the 
Kendall’s Distance between n (X) and v (Y) of the sample. Figure 4 illustrates 
results from the following confirmatory experiment: 

• Generate 1000 sets of 1000 points from a Frank(0 = 3) copula 

• Compute the Kendall distance D and the Frank density d for each set 

• Plot d vs D on a log-log scale 

Note also that the Frank copula is radially symmetric, C (u,v) = u + v — 1 + 
C (1 — u, 1 — v), which is a necessary condition for the density of a sample to depend 
only on its Kendall distance. With the assurance that there are copulas with 
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Frank Copula: t vs. 0 for 0>O Mallows Model: t vs. § Mallows(<|),n) vs. Frank(0) 




Figure 3. Scale relationships. Left: Kendall’s r vs. the Frank 
scale 9; Center: Kendall’s r vs. the Mallows scale (f> for n = 
100,1000, 5000,10000; Right: Mallows 4> n vs. Frank#, for n = 
100,1000,5000. 


Asymptotic NormaEty of Kendal's Distance Under a Frank Copula Mean and SD of LogLikelihood Statistic vs. x for n = 1000 
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Figure 4. Association between X and Y in a Frank copula ap¬ 
proaches a Mallows model for the ranking v (Y) in a large sample 
centered at 7 r(X). Left: Approximate normality Kendall’s Dis¬ 
tance D in samples of size n = 1000 from a Frank(# = 3) copula; 
Right: log-log plot of the density of each sample vs. its Kendall’s 
distance D. 


the conditional distribution of v (Y) given n (X) well approximated by a Mallows 
model, this becomes the null hypothesis: 

H 0 : v (Y) o 7 t _1 (X) ~ Mallows ( 9 ), for some 9 > 0. 

The general alternative against which we would like a test to be sensitive is that 
there is a subpopulation with high association with the remainder having (little or) 
no association. The test for heterogeneity should maintain power over a wide variety 
of alternative distributions for the subpopulation supporting strong association. 
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With these considerations, the alternative hypothesis is formulated as 


H a : (F (X ), G (Y)) ~ M 


where M is a mixture of two distributions: Hi on which (F (X), G (Y)) are inde¬ 
pendent and H 2 on which (F (X), G (Y)) have r > 0. 

To test against such a general alternative, an adaptive model encompassing the 
Mallows model is adopted, with the number of free parameters in the model in¬ 
creasing with sample size. This components of Kendall’s tau (CKT) test proceeds 
in four steps: 

(1) Fit a Mallows model centered at n (X) to v (Y) and compute the likelihood. 

(2) Reorder the data points {(Y, ; , Yi) \ i = 1,..., n}, so that Kendall’s tau co¬ 
efficient is decreasing. See Yu et al. (2011). Call the reordering <7. 

(3) Smoothly fit a multistage ranking model to the relative rankings of 

\Y a (r),..., Y cr (fe)] to [X cr (i),..., at each stage k. See Sampath and 

Verducci (2013). Compute the likelihood under this (encompassing) model. 

(4) Use the (Generalized) Likelihood Ratio statistic to test Hq. 

Comments on the four steps: 

(1) Since Kendall’s tau distance is invariant to reordering of observations, this 
is the same as fitting a Mallows model, centered at ranking (crX), to the 
ranking (<tY), where er is the taupath reordering. 

(2) The idea of reordering is to put the points displaying the highest amount 
of association earlier in the sequence in order to identify the subpopulation 
with highest empirical association. The reordering is not unique. Yu et al. 
(2011) discuss various algorithms. 

(3) The multistage ranking model decomposes the number of discordances [up 
to (n choose 2)] between ranking (crY) and ranking (crX), as a sum of n— 1 
variables {14} with ranges {0,..., k}, k = 1,..., n — 1. The model has 
likelihood L = c (9) e~ Y s kV k which reduces to the likelihood of Mallows 
model when all component parameters are equal. 

(4) The conditions needed to justify an asymptotic chi-square distribution for 
this statistic do not hold in this setting. Currently, we simulate the dis¬ 
tribution under the Frank copula to get an appropriate reference. We are 
working to find a more precise characterization of the LR in this setting. 

The null distribution of this likelihood ratio appears to be close to normal, with its 
mean decreasing with the common correlation r, and standard deviation constant. 
See Figure 5. Note that, for n = 1000, the variance of 2 • LLR ss 2500 is clearly less 
than its 2 • mean(2 • LLR) theoretical value for a chi square distribution, which is 
in the range (3100, 3800) when r G (.10, .30). 

Instead of fixed n and varying r, Figure 6 depicts the relationship between 
LLR and n with fixed r = 0.1. The overall relationship between the moments of 
LLR and the parameters r and n is not yet known, but using a practical additive 
approximation in the range .1 < r < .3 and 500 < n < 3000, the basic asymptotic 
a-level CKT test has the form: Reject H 0 if 


LLR - (n + 20 - 797f) 
0.02n + 7 


^ Zl — on 


where f is Kendall’s correlation coefficient and Z\_ a is the (1 — cc) th quantile of the 
standard normal. 



NONPARAMETRIC TESTING FOR HETEROGENEOUS CORRELATION 


10 


Histogram of LLR for n = 1000 under Frank with t = Mean and SD of Lo9Llkcllhood Statis “ c VST,orn = 1000 




lut 


Figure 5. Simulation of Log Likelihood Ratio (LLR) Statistic for 
CKT test under Frank Copulas. Left: Histogram of 500 simula¬ 
tions of size n = 1000. Right: Decreasing pattern of mean and 
constancy of standard deviation for LLR under Frank Copulas at 
different levels of r for n = 1000. 


Simulations for Robustness and Power 

First, performance of the tests is checked by maintenance of levels under various 
Gaussian and Frank copulas; subsequently power is examined. The CSF test is 
based on the number of absolute rank differences less than .2. Figure 7 shows the 
null distributions of p-values for the CSF test applied to samples of size n = 1000 
generated 100,000 times under the Gaussian(p) models. These distributions start 
to become stochastically smaller than uniform for p > 0.45. Otherwise the test is 
conservative in the range 0 < p < 0.45 and 0 < a < 0.05 as illustrated by the 
observed number of type 1 errors at the a = 0.05 level. 

Under similar Gaussian copulas, the adjustment of the mean of the log-likelihood 
for the estimated overall r makes the CKT test behave conservatively for large 
values of p , but gives highly significant values for r values near 0. See Figure 8, 
in which, due to computational limitations, lowess-smoothed curves describe the 
p-distribution based on only 100 simulations. In the presence of very low overall 
correlation, it is advisable to use the CSF test as a screen for the CKT, which will 
protect the CKT from finding uneven levels of t association when p association 
is homogeneous. Again, this tendency toward excess false positives happens only 
when the overall p association is close to 0. In this case a special test (Sampath 
and Verducci, 2013) is available for the null hypothesis of independence. Under a 
Frank copula, the CSF test behaves properly near independence, but loses its level 
when t gets large. See Figure 9. 

Several factors affect the power curves of both the CSF and CKT tests: sample 
size (n is fixed at 500 or 1000); proportionate size of the subpopulation (fixed at 
40%); strength of association in the subpopulation (p, r € {.7, .8, .9}); and, most im¬ 
portantly, the form of the subpopulation. Against the null hypothesis of a Gaussian 
copula, the alternative is a mixture of copulas, where the variables are assumed to 
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Distribution of p-values from the CSF test Prob. of Type 1 ^ f „ nonwvil „ = 05 tevel test 

under the Gaussian Copula Nul Hypothesis 




Figure 6. CSF results from 100,000 simulation experiments of 
size n = 1000 for values of p in a Gaussian copula. Left: Distri¬ 
bution of p-values for 7 values of null correlation; Right: Observed 
probabilities of Type I error for a nominal a = 0.05 test. The test 
is conservative for values of p < 0.45. 


Distribution of p-values from the CKT test under the Probability of Type 1 Error of nomimal_05 level 

Gaussian Copula Nul Hypotheses CKT test under Gaussian Copulas 



Nurber of Simdatons 


Figure 7. CKT results from 200 simulation experiments of size 
n = 1000 for values of p in a Gaussian copula; Left: Distribution 
of p-values for 7 values of the null correlation; Right: Obeserved 
probabilities of Type I error for a nominal a = 0.05 test. The test 
is conservative for values of p > 0.16. 

be independent in the complement of the subpopulation. Against the null hypoth¬ 
esis of a Frank discordancescopula, the subpopulation is selected at random and its 
conditional distribution is forced into a stronger Mallows model. This allows the 
population margins to remain uniform while possibly restricting the range of the 
subpopulation. 

Figure 10 shows the distribution of p-values of both CSF and CKT tests against 
40% Gaussian with .12 < p < .13. For this range of overall correlation the CKT 
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P-values against 40% Gaussian copula subpopulation; n = 1000 



Figure 8. P-values against Alternative Mixture of Gaussian cop¬ 
ulas with sample size n = 1000 and subpopulation proportion 
= 40%. Dashed horizontal lines at a = 0.05 indicates the power of 
a level 0.05 test. 


test holds its level and is conservative for overall correlation p < .12, which is the 
case here. Nevertheless, it achieves perfect power when the subpopulation p > .125, 
even though its power quickly diminishes to 10% for p = .12 in the subpopulation. 
It also performs better than CSF in this range. 

Under the Mallows alternative, n = 1000 points are generated from a uniform 
distribution, 400 points are then sampled from a quantile range of x values and 
the y values resorted according to a random draw from a Mallow(^i (r)) model. 
Values of r used are .4, .5, and .6. Figure 11 shows the distributions of p-values 
from the a = 0.05 level CSF and CKT tests over 100 simulations. The left panel 
corresponds to the subpopulation being sampled from the full range, while the right 
panel corresponds to samples between the 20th and 80th percentiles of x-values. 
The CKT test performs much better than the CSF test against these alternatives. 
The CKT has essentially perfect detection when the subpopulation spans the whole 
range, and at least 70% power in the 20-80 percentile range. The CSF has no power 
in either scenario. 


Example 

Wine cultivars are varieties of grapes that have been cultivated through selective 
breeding. Different varieties may be characterized by certain chemical properties 
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P-values against 40% Malows subpopulation; n = 1000 P-values against 40% Malows restricted subpopulation; n = 1000 



Narber of Sirrtfations Nunber of SiimJatiods 


Figure 9. Distribution of p-values for Mallows alternatives, based 
on 100 simulations of sample size n = 1000. Left: Associated 
subgroup spans the full range of x-values. Right: Span is restricted 
to the 20-to-80th percentiles region of x-values in the larger popu¬ 
lation. 


of the wine they produce. Early work in supervised learning has been used to clas¬ 
sify wine cultivars using chemical 
1993). These data, available at 
databases/wine), are reanalyzed here using the CKT and CSF tests as unsupervised 
methods of detecting different association structures that might help characterize 
different cultivated varieties. 

Figure 12 shows the relationship between flavenoids and phenols in the data 
set consisting of 13 measurements from 178 wine samples derived from 3 different 
cultivars. To the untrained eye, the overall plot looks typical of homogeneous as¬ 
sociation, but both the CKT (p =.0002) and CSF (p=.027) indicate heterogeneity. 
Identification of cultivars in the plot shows separation of cultivar 1 and 3 samples 
from each other, with slightly negative association within each of these groups; 
however, their positioning contributes a kind of ecological correlation to the overall 
sample. In contrast, samples from cultivar 2 show a strong positive association be¬ 
tween flavenoid and phenol content. This suggests an underlying genetic difference. 

It is impressive that CKT can detect this heterogeneity of association from the 
unlabelled data, which looks like an overall positive association, part of which is 
ecological correlation. Although the CSF test does also indicate association, it is 
not as sensitive at detecting it in this situation, and its p-value would not present a 
strong case for heterogeneity if any correction is attempted for multiple comparisons 
over the 13 choose 2 (78) pairs of variables available. 

Concluding Remarks 

The ability to detect subpopulations that drive association has the potential of 
changing the way statistics are used to unveil structures in “Big Data.” Instead 
of employing extensive model searching with complex interaction, now relatively 
model-free methods are available to ascertain with precision is there is any simple 
mixture that better explains monotone association between variables. The CSF 


measurements of wine sample (Aeberhard, et al. 
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and CKT tests achieve this, either working together to screen and confirm or sepa¬ 
rately to find different forms of the subpopulation that most strongly supports the 
association. 

These tests, however, are formally restricted to different forms of the meaning 
of “homogeneous association.” Strict legitimacy of the CSF test depends on the 
assumption of a Gaussian copula underlying the null distribution, whereas the CKT 
test depends on the assumption of a Frank copula underlying the null distribution. 
Although there is some evidence of limited robustness, much more work should be 
done to explore the behavior of these tests under general conditions. For example, 
both the Gaussian and Frank copulas are radially symmetric; it is unclear how 
sensitive the tests would be to asymmetric notions of homogeneous association. 

The computationally efficiency of the CSF test is important because the sample 
size n needs to be in the thousands before there is much hope of reliably detecting 
these subtle but important differences. In contrast with the CSF test, the justifi¬ 
cation of CKT is a bit more compelling, based on intrinsic association within the 
subpopulation. We have been using CSF at a liberal a = 0.05 level as a screening 
devise to reduce the number of pairs of variables to be tested at a more stringent 
level. 

Detecting heterogeneity of association is a difficult task. Such detection is prac¬ 
tical only when the overall association is not too strong, the association in the 
subpopulation is strong, and the sample size is large. Nevertheless, such scenarios 
abound. We believe that these new methods will make Statistics ever more relevant 
in making good sense from Big Data. 
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